Hands-on Exercise 3.1: Processing and Visualising Flow Data
15.1 Overview
Spatial interaction represent the flow of people, material, or information between locations in geographical space. It encompasses everything from freight shipments, energy flows, and the global trade in rare antiquities, to flight schedules, rush hour woes, and pedestrian foot traffic.
Each spatial interaction, as an analogy for a set of movements, is composed of a discrete origin/destination pair. Each pair can be represented as a cell in a matrix where rows are related to the locations (centroids) of origin, while columns are related to locations (centroids) of destination. Such a matrix is commonly known as an origin/destination matrix, or a spatial interaction matrix.
In this hands-on exercise, build an OD matrix by using Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall.
By the end to this hands-on exercise, we will be able to:
to import and extract OD data for a selected time interval,
to import and save geospatial data (i.e. bus stops and mpsz) into sf tibble data frame objects,
to populate planning subzone code into bus stops sf tibble data frame,
to construct desire lines geospatial data from the OD data, and
to visualise passenger volume by origin and destination bus stops by using the desire lines data.
15.2 Getting Started
For the purpose of this exercise, four R packages will be used. They are:
sf for importing, integrating, processing and transforming geospatial data.
tidyverse for importing, integrating, wrangling and visualising data.
tmap for creating thematic maps
stplanr for solving common problems in transport planning and modelling, such as how to best get from point A to point B
ggpubr for some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
performance for for computing measures to assess model quality, which are not directly provided by R’s ‘base’ or ‘stats’ packages. The primary goal of the performance package is to provide utilities for computing indices of model quality and goodness of fit. These include measures like r-squared (R2), root mean squared error (RMSE)
15.3 Preparing the Flow Data
15.3.1 Importing the OD data
Import the Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall by using read_csv() of readr package.
Display the odbus tibble data table by using the code chunk below.
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
A quick check of odbus tibble data frame shows that the values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type. Hence, the code chunk below is used to convert these data values into character data type.
15.3.2 Extracting the study data
The data in odbus is generalised into weekend and weekday data. For the purpose of this exercise, we will extract commuting flows on weekday and between 6 and 9 o’clock. After the group-by and sum, the total rows reduced from 5,709,512 ro 241,503.
Print the content of odbus6_9
If we would like to, we could save the output in rds format for future use. We need to ensure that there exists a folder called ‘rds’ in ‘data’ folder before running the code chunk.
To read rds files:
15.4 Working with Geospatial Data
In this exercise, two geospatial data will be used. They are:
BusStop: This data provides the location of bus stop as at the third quarter of 2023. This data is refreshed quarterly by LTA. The last update was in July 2023.
MPSZ-2019: This data provides the sub-zone boundary of URA Master Plan 2019.
Both data sets are in ESRI shapefile format.
15.4.1 Importing geospatial data
Load the BusStop geospatial data using the st_read() function of sf package. Using the st_crs(busstop) will show that the coordinate system used is WSG84 (decimal deg). Using st_tranform(), we will convert the geographical coordinates system to SIngapore’s projected coordinate system crs=3414.
Note that there are repeated bus stop ids , however they have different bus stop roof ids and geometry values.
busstop <- st_read(dsn = "data/geospatial/BusStopLocation/BusStopLocation_Jul2023",
layer = "BusStop") %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\yixin-neo\ISSS624_AGA\Hands-on_Ex3\data\geospatial\BusStopLocation\BusStopLocation_Jul2023'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Next load the mpsz data. There are 332 planning subzones in Singapore.
mpsz <- st_read(dsn = "data/geospatial/MPSZ-2019",
layer = "MPSZ-2019") %>%
st_transform(crs = 3414)Reading layer `MPSZ-2019' from data source
`C:\yixin-neo\ISSS624_AGA\Hands-on_Ex3\data\geospatial\MPSZ-2019'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
SUBZONE_N SUBZONE_C PLN_AREA_N PLN_AREA_C REGION_N
1 MARINA EAST MESZ01 MARINA EAST ME CENTRAL REGION
2 INSTITUTION HILL RVSZ05 RIVER VALLEY RV CENTRAL REGION
3 ROBERTSON QUAY SRSZ01 SINGAPORE RIVER SR CENTRAL REGION
4 JURONG ISLAND AND BUKOM WISZ01 WESTERN ISLANDS WI WEST REGION
5 FORT CANNING MUSZ02 MUSEUM MU CENTRAL REGION
6 MARINA EAST (MP) MPSZ05 MARINE PARADE MP CENTRAL REGION
7 SUDONG WISZ03 WESTERN ISLANDS WI WEST REGION
8 SEMAKAU WISZ02 WESTERN ISLANDS WI WEST REGION
9 SOUTHERN GROUP SISZ02 SOUTHERN ISLANDS SI CENTRAL REGION
10 SENTOSA SISZ01 SOUTHERN ISLANDS SI CENTRAL REGION
REGION_C geometry
1 CR MULTIPOLYGON (((33222.98 29...
2 CR MULTIPOLYGON (((28481.45 30...
3 CR MULTIPOLYGON (((28087.34 30...
4 WR MULTIPOLYGON (((14557.7 304...
5 CR MULTIPOLYGON (((29542.53 31...
6 CR MULTIPOLYGON (((35279.55 30...
7 WR MULTIPOLYGON (((15772.59 21...
8 WR MULTIPOLYGON (((19843.41 21...
9 CR MULTIPOLYGON (((30870.53 22...
10 CR MULTIPOLYGON (((26879.04 26...
st_read()function of sf package is used to import the shapefile into R as sf data frame.st_transform()function of sf package is used to transform the projection to crs 3414.
15.5 Geospatial data wrangling
15.5.1 Combining Busstop and mpsz
Code chunk below populates the planning subzone code (i.e. SUBZONE_C) of mpsz sf data frame into busstop sf data frame. The output of st_intersection() is a point sf object. We do not need and therefore will drop the geometry. The number of observations reduced from 5,161 to 5,156 after operation, suggesting that 5 bus stops have been dropped as their point geometry is not within the polygon boundary of sf df mpsz.
st_intersection()is used to perform point and polygon overly and the output will be in point sf object.select()of dplyr package is then use to retain only BUS_STOP_N and SUBZONE_C in the busstop_mpsz sf data frame.five bus stops are excluded in the resultant data frame because they are outside of Singapore boundary.
Save the output into rds format
Next, we are going to append the planning subzone code from busstop_mpsz data frame onto odbus6_9 data frame. By doing so, we get the fields ‘ORIGIN_BS’, ‘DESTIN_BS” and ’ORIGIN_SZ’ all in a df .
# A tibble: 26 × 3
BUS_STOP_N SUBZONE_C LOC_DESC
<chr> <chr> <chr>
1 11009 QTSZ01 Ghim Moh Ter
2 11009 QTSZ01 GHIM MOH TER
3 82221 GLSZ05 BLK 3A
4 82221 GLSZ05 Blk 3A
5 22501 JWSZ09 Blk 662A
6 22501 JWSZ09 BLK 662A
7 96319 TMSZ05 Yusen Logistics
8 96319 TMSZ05 YUSEN LOGISTICS
9 43709 BKSZ07 BLK 644
10 43709 BKSZ07 BLK 644
# ℹ 16 more rows
The join columns will be ‘ORIGIN_PT_CODE’ from odbus6_9 df and ‘BUS_STOP_N’ from busstop_mpsz df. The columns will also be renamed.
Before left_join, odbus6_9 has 241,503 rows, after left join od_data has 242,235 rows.
Check for duplicate for proceeding
# A tibble: 794 × 5
ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC
<chr> <fct> <dbl> <chr> <chr>
1 43709 43009 15 BKSZ07 BLK 644
2 43709 43009 15 BKSZ07 BLK 644
3 43709 43419 42 BKSZ07 BLK 644
4 43709 43419 42 BKSZ07 BLK 644
5 43709 43469 1 BKSZ07 BLK 644
6 43709 43469 1 BKSZ07 BLK 644
7 43709 43479 62 BKSZ07 BLK 644
8 43709 43479 62 BKSZ07 BLK 644
9 43709 43489 23 BKSZ07 BLK 644
10 43709 43489 23 BKSZ07 BLK 644
# ℹ 784 more rows
Remove the duplicated records. The od_data df reduced from 242,235 rows to 241,838 rows after moving duplicates.
Double check again
# A tibble: 0 × 5
# ℹ 5 variables: ORIGIN_BS <chr>, DESTIN_BS <fct>, TRIPS <dbl>,
# ORIGIN_SZ <chr>, LOC_DESC <chr>
Print the current od_data df to see what we are still lacking. We are will missing the destination subzone codes.
| ORIGIN_BS | DESTIN_BS | TRIPS | ORIGIN_SZ | LOC_DESC |
|---|---|---|---|---|
| 01012 | 01112 | 276 | RCSZ10 | HOTEL GRAND PACIFIC |
| 01012 | 01113 | 143 | RCSZ10 | HOTEL GRAND PACIFIC |
| 01012 | 01121 | 66 | RCSZ10 | HOTEL GRAND PACIFIC |
Again, get the destination subzone code for each destination bus stops by performing a left_join again with busstop_mpsz (contains subzone_c codes for each bus stop id).
After left_join, the number of rows increased from 241,838 rows to 242,831 rows.
Check for duplicates
# A tibble: 880 × 7
ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC.x SUBZONE_C LOC_DESC.y
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 01013 51071 1 RCSZ10 ST JOSEPH'S CH CCSZ01 MACRITCHI…
2 01013 51071 1 RCSZ10 ST JOSEPH'S CH CCSZ01 MACRITCHI…
3 01112 51071 65 RCSZ10 OPP BUGIS STN EXIT C CCSZ01 MACRITCHI…
4 01112 51071 65 RCSZ10 OPP BUGIS STN EXIT C CCSZ01 MACRITCHI…
5 01112 53041 5 RCSZ10 OPP BUGIS STN EXIT C BSSZ01 Upp Thoms…
6 01112 53041 5 RCSZ10 OPP BUGIS STN EXIT C BSSZ01 Upp Thoms…
7 01121 51071 3 RCSZ04 STAMFORD PR SCH CCSZ01 MACRITCHI…
8 01121 51071 3 RCSZ04 STAMFORD PR SCH CCSZ01 MACRITCHI…
9 01239 51071 22 KLSZ09 SULTAN PLAZA CCSZ01 MACRITCHI…
10 01239 51071 22 KLSZ09 SULTAN PLAZA CCSZ01 MACRITCHI…
# ℹ 870 more rows
Remove duplicates
Sneak peak of the current od_data
| ORIGIN_BS | DESTIN_BS | TRIPS | ORIGIN_SZ | LOC_DESC.x | SUBZONE_C | LOC_DESC.y |
|---|---|---|---|---|---|---|
| 01012 | 01112 | 276 | RCSZ10 | HOTEL GRAND PACIFIC | RCSZ10 | OPP BUGIS STN EXIT C |
| 01012 | 01113 | 143 | RCSZ10 | HOTEL GRAND PACIFIC | DTSZ01 | BUGIS STN EXIT B |
| 01012 | 01121 | 66 | RCSZ10 | HOTEL GRAND PACIFIC | RCSZ04 | STAMFORD PR SCH |
The code chunk below will do the following:
Renames the destination ‘SUBZONE_C’ to ‘DESTIN_SZ’.
There are missing subzone codes for some of the origin and destination bus stop because the bus stops location in July 2023 could be more outdated than August bus stop 2023. We will drop columns with missing values.
Group-by origin subzone and destination subzone to generate a new field ‘MORNING_PEAK’ that contains the summation of all trips from subzone A to subzone B.
Take a look at our final od_data df
| ORIGIN_SZ | DESTIN_SZ | MORNING_PEAK |
|---|---|---|
| TMSZ02 | TMSZ02 | 350755 |
| WDSZ03 | WDSZ03 | 239791 |
| JWSZ08 | JWSZ09 | 234343 |
| BDSZ04 | BDSZ04 | 217917 |
| JWSZ09 | JWSZ09 | 153055 |
| YSSZ03 | YSSZ01 | 152800 |
| JWSZ04 | JWSZ04 | 149110 |
| TMSZ03 | TMSZ02 | 134196 |
| PRSZ05 | PRSZ03 | 103148 |
| JWSZ03 | JWSZ04 | 99666 |
Save the output into an rds file format.
15.6 Visualising Spatial Interaction
In this section, wewill learn how to prepare a desire line by using stplanr package.
15.6.1 Removing intra-zonal flows
We will not plot the intra-zonal flows. The code chunk below will be used to remove intra-zonal flows. It does so by removing the flows that originate and ends in the same subzone.
Rows reduced from 20,987 to 20,697.
15.6.2 Creating desire lines
In this code chunk below, od2line() of stplanr package is used to create the desire lines.
od_data1 is aspatial while mpsz is geospatial data.
Arguments
- flow
-
A data frame representing origin-destination data. The first two columns of this data frame should correspond to the first column of the data in the zones. Thus in
cents_sf(), the first column is geo_code. This corresponds to the first two columns offlow(). - zones
-
A spatial object representing origins (and destinations if no separate destinations object is provided) of travel.
- destinations
-
A spatial object representing destinations of travel flows.
- zone_code
-
Name of the variable in
zonescontaining the ids of the zone. By default this is the first column names in the zones.
The output flowLine is a sf LINESTRING object.
Simple feature collection with 20697 features and 3 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 5105.594 ymin: 25813.33 xmax: 49483.22 ymax: 49552.79
Projected CRS: SVY21 / Singapore TM
First 10 features:
ORIGIN_SZ DESTIN_SZ MORNING_PEAK geometry
1 AMSZ01 AMSZ02 11742 LINESTRING (29501.77 39419....
2 AMSZ01 AMSZ03 14886 LINESTRING (29501.77 39419....
3 AMSZ01 AMSZ04 3237 LINESTRING (29501.77 39419....
4 AMSZ01 AMSZ05 9349 LINESTRING (29501.77 39419....
5 AMSZ01 AMSZ06 2231 LINESTRING (29501.77 39419....
6 AMSZ01 AMSZ07 1714 LINESTRING (29501.77 39419....
7 AMSZ01 AMSZ08 2624 LINESTRING (29501.77 39419....
8 AMSZ01 AMSZ09 2371 LINESTRING (29501.77 39419....
9 AMSZ01 AMSZ10 183 LINESTRING (29501.77 39419....
10 AMSZ01 AMSZ11 930 LINESTRING (29501.77 39419....
15.6.3 Visualising the desire lines
To visualise the resulting desire lines, the code chunk below is used.
Arguments of tm_lines():
lwd: line width. Either a numeric value or a data variable. In the latter case, the class of the highest values (see style) will get the line width defined by scale. If multiple values are specified, small multiples are drawn (see details).
style: method to process the color scale when col is a numeric variable. Discrete gradient options are "cat", "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", "jenks", "dpih", "headtails", and "log10_pretty". A numeric variable is processed as a categorical variable when using "cat", i.e. each unique value will correspond to a distinct category
scale: line width multiplier number.
n: preferred number of color scale classes. Only applicable when lwd is the name of a numeric variable.
tm_shape(mpsz) +
tm_polygons() +
flowLine %>%
tm_shape() +
tm_lines(lwd = 'MORNING_PEAK',
style = 'quantile',
scale= c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha= 0.3)
Rendering process takes about 1 min because of the transparency argument alpha.
When the flow data are very messy and highly skewed like the one shown above, it is wiser to focus on selected flows, for example flow greater than or equal to 5000 as shown below.
tmap_mode('view')
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz) +
tm_polygons() +
flowLine %>%
filter(MORNING_PEAK >= 5000) %>%
tm_shape() +
tm_lines(lwd = "MORNING_PEAK",
style = "quantile",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3)Summaries
OD matrix is often incomplete. Imagine trying to complete the OD matrix, it would involve us doing spatial interaction or OD surveys to find the missing values. There are 332 subzones in Singapore, and each survey is expensive,. In additon, OD matrix is constantly changing as flow patterns changes. We are trying to predict flows between origins and destinations. Flow could be thought of a function of (1) attribute of origin (propulsiveness) (2) attribute of destination (attractiveness) and (3) cost friction (like distance or transport cost or public transport stops). Assumption is that the benefits must outweigh the cost in order for flow to happen.
Gravity model takes into consideration the interaction between all origin and destination locations.
Potential model takes in consideration the interaction between a location and all other location pairs. (Good for measuring accessibility)
Retail model is commonly used by franchise like KFC / Pizza Hut to determine their area/region of service (aka delivery zones) for each outlet.
There are 4 variations in the Gravity model:
- Uncontrained: only the overall outflow is fixed and total outflow from origins = total inflow to destinations
- Origin constrained: outflow by origin is fixed.
- Destination constrained: inflow by destination is fixed.
- Doubly constrained: outflow by origin and inflow by destination is fixed.
To calculate flow from each origin to each destination, we need parameters like k, alpha, lambda and beta. The beta for distance is usually negative because we assume that there is an inverse relationship between interaction and distance, like Newtonian physics and laws of gravity.
